{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# Preprocessing\n",
    "\n",
    "Haystack includes a suite of tools to extract text from different file types, normalize white space\n",
    "and split text into smaller pieces to optimize retrieval.\n",
    "These data preprocessing steps can have a big impact on the systems performance and effective handling of data is key to getting the most out of Haystack."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Ultimately, Haystack expects data to be provided as a list of documents in the following dictionary format:\n",
    "``` python\n",
    "docs = [\n",
    "    {\n",
    "        'content': DOCUMENT_TEXT_HERE,\n",
    "        'meta': {'name': DOCUMENT_NAME, ...}\n",
    "    }, ...\n",
    "]\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "This tutorial will show you all the tools that Haystack provides to help you cast your data into this format."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installing Haystack\n",
    "\n",
    "To start, let's install the latest release of Haystack with `pip`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "pip install --upgrade pip\n",
    "pip install farm-haystack[colab,ocr,preprocessing,file-conversion,pdf]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Enabling Telemetry \n",
    "Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from haystack.telemetry import tutorial_running\n",
    "\n",
    "tutorial_running(8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Logging\n",
    "\n",
    "We configure how logging messages should be displayed and which log level should be used before importing Haystack.\n",
    "Example log message:\n",
    "INFO - haystack.utils.preprocessing -  Converting data/tutorial1/218_Olenna_Tyrell.txt\n",
    "Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import logging\n",
    "\n",
    "logging.basicConfig(format=\"%(levelname)s - %(name)s -  %(message)s\", level=logging.WARNING)\n",
    "logging.getLogger(\"haystack\").setLevel(logging.INFO)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from haystack.utils import fetch_archive_from_http\n",
    "\n",
    "\n",
    "# This fetches some sample files to work with\n",
    "doc_dir = \"data/tutorial8\"\n",
    "s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial8.zip\"\n",
    "fetch_archive_from_http(url=s3_url, output_dir=doc_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Converters\n",
    "\n",
    "Haystack's converter classes are designed to help you turn files on your computer into the documents\n",
    "that can be processed by the Haystack pipeline.\n",
    "There are file converters for txt, pdf, docx files as well as a converter that is powered by Apache Tika.\n",
    "The parameter `valid_languages` does not convert files to the target language, but checks if the conversion worked as expected. Here are some examples of how you would use file converters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from haystack.nodes import TextConverter, PDFToTextConverter, DocxToTextConverter, PreProcessor\n",
    "\n",
    "\n",
    "converter = TextConverter(remove_numeric_tables=True, valid_languages=[\"en\"])\n",
    "doc_txt = converter.convert(file_path=\"data/tutorial8/classics.txt\", meta=None)[0]\n",
    "\n",
    "converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=[\"en\"])\n",
    "doc_pdf = converter.convert(file_path=\"data/tutorial8/bert.pdf\", meta=None)[0]\n",
    "\n",
    "converter = DocxToTextConverter(remove_numeric_tables=False, valid_languages=[\"en\"])\n",
    "doc_docx = converter.convert(file_path=\"data/tutorial8/heavy_metal.docx\", meta=None)[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Haystack also has a convenience function that will automatically apply the right converter to each file in a directory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from haystack.utils import convert_files_to_docs\n",
    "\n",
    "\n",
    "all_docs = convert_files_to_docs(dir_path=doc_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## PreProcessor\n",
    "\n",
    "The PreProcessor class is designed to help you clean text and split text into sensible units.\n",
    "File splitting can have a very significant impact on the system's performance and is absolutely mandatory for Dense Passage Retrieval models.\n",
    "In general, we recommend you split the text from your files into small documents of around 100 words for dense retrieval methods\n",
    "and no more than 10,000 words for sparse methods.\n",
    "Have a look at the [Preprocessing](https://docs.haystack.deepset.ai/docs/preprocessor)\n",
    "and [Optimization](https://docs.haystack.deepset.ai/docs/optimization) pages on our website for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from haystack.nodes import PreProcessor\n",
    "\n",
    "\n",
    "# This is a default usage of the PreProcessor.\n",
    "# Here, it performs cleaning of consecutive whitespaces\n",
    "# and splits a single large document into smaller documents.\n",
    "# Each document is up to 1000 words long and document breaks cannot fall in the middle of sentences\n",
    "# Note how the single document passed into the document gets split into 5 smaller documents\n",
    "\n",
    "preprocessor = PreProcessor(\n",
    "    clean_empty_lines=True,\n",
    "    clean_whitespace=True,\n",
    "    clean_header_footer=False,\n",
    "    split_by=\"word\",\n",
    "    split_length=100,\n",
    "    split_respect_sentence_boundary=True,\n",
    ")\n",
    "docs_default = preprocessor.process([doc_txt])\n",
    "print(f\"n_docs_input: 1\\nn_docs_output: {len(docs_default)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Cleaning\n",
    "\n",
    "- `clean_empty_lines` will normalize 3 or more consecutive empty lines to be just a two empty lines\n",
    "- `clean_whitespace` will remove any whitespace at the beginning or end of each line in the text\n",
    "- `clean_header_footer` will remove any long header or footer texts that are repeated on each page"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Splitting\n",
    "By default, the PreProcessor will respect sentence boundaries, meaning that documents will not start or end\n",
    "midway through a sentence.\n",
    "This will help reduce the possibility of answer phrases being split between two documents.\n",
    "This feature can be turned off by setting `split_respect_sentence_boundary=False`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Not respecting sentence boundary vs respecting sentence boundary\n",
    "\n",
    "preprocessor_nrsb = PreProcessor(split_respect_sentence_boundary=False)\n",
    "docs_nrsb = preprocessor_nrsb.process([doc_txt])\n",
    "\n",
    "print(\"RESPECTING SENTENCE BOUNDARY\")\n",
    "end_text = docs_default[0].content[-50:]\n",
    "print('End of document: \"...' + end_text + '\"')\n",
    "print()\n",
    "print(\"NOT RESPECTING SENTENCE BOUNDARY\")\n",
    "end_text_nrsb = docs_nrsb[0].content[-50:]\n",
    "print('End of document: \"...' + end_text_nrsb + '\"')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "A commonly used strategy to split long documents, especially in the field of Question Answering,\n",
    "is the sliding window approach. If `split_length=10` and `split_overlap=3`, your documents will look like this:\n",
    "\n",
    "- doc1 = words[0:10]\n",
    "- doc2 = words[7:17]\n",
    "- doc3 = words[14:24]\n",
    "- ...\n",
    "\n",
    "You can use this strategy by following the code below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Sliding window approach\n",
    "\n",
    "preprocessor_sliding_window = PreProcessor(split_overlap=3, split_length=10, split_respect_sentence_boundary=False)\n",
    "docs_sliding_window = preprocessor_sliding_window.process([doc_txt])\n",
    "\n",
    "doc1 = docs_sliding_window[0].content[:200]\n",
    "doc2 = docs_sliding_window[1].content[:100]\n",
    "doc3 = docs_sliding_window[2].content[:100]\n",
    "\n",
    "print('Document 1: \"' + doc1 + '...\"')\n",
    "print('Document 2: \"' + doc2 + '...\"')\n",
    "print('Document 3: \"' + doc3 + '...\"')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Bringing it all together"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "all_docs = convert_files_to_docs(dir_path=doc_dir)\n",
    "preprocessor = PreProcessor(\n",
    "    clean_empty_lines=True,\n",
    "    clean_whitespace=True,\n",
    "    clean_header_footer=False,\n",
    "    split_by=\"word\",\n",
    "    split_length=100,\n",
    "    split_respect_sentence_boundary=True,\n",
    ")\n",
    "docs = preprocessor.process(all_docs)\n",
    "\n",
    "print(f\"n_files_input: {len(all_docs)}\\nn_docs_output: {len(docs)}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.10.6 64-bit",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "3.10.6"
  },
  "vscode": {
   "interpreter": {
    "hash": "bda33b16be7e844498c7c2d368d72665b4f1d165582b9547ed22a0249a29ca2e"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
